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Abstract —Time-consistency is an essential requirement in 
risk sensitive optimal control problems to make rational deci¬ 
sions. An optimization problem is time consistent if its solution 
policy does not depend on the time sequence of solving the 
optimization problem. On the other hand, a dynamic risk 
measure is time consistent if a certain outcome is considered 
less risky in the future implies this outcome is also less risky at 
current stage. In this paper, we study time-consistency of risk 
constrained problem where the risk metric is time consistent. 
From the Bellman optimality condition in [1], we establish 
an analytical “rlsk-to-go” that results in a time consistent 
optimal policy. Finally we demonstrate the effectiveness of the 
analytical solution by solving Haviv’s counter-example [2] in 
time inconsistent planning. 

I. Introduction 

Stochastic Optimal Control (SOC) is concerned with se¬ 
quential decision-making under uncertainty. Consider a dy¬ 
namical process that can be influenced by exogenous noises 
as well as decisions made at every time step. The decision 
maker wants to optimize the behavior of the dynamical 
system over a certain time horizon by finding a policy that 
maps the history of states to optimal actions. 

In this SOC setup, we solve an optimization problem at 
present time and determine an optimal policy that maps states 
to actions subsequently to minimize cumulative cost. In order 
to obtain rational decisions, we aim to solve for a time 
consistent control policy for which this solution policy is 
optimal to both the SOC at present and the tail-subproblems 
in all subsequent time steps. If such policy exists, the SOC 
problem is also known to be time consistent. The property 
of time consistent problem is formally stated as follows. The 
decision maker formulates an optimization problem at time 
k = 0 that yields a sequence of optimal decision rules for 
k = 0 to k = N. Then, at the next time step k = 1, he/she 
formulates a new problem starting at fc = 1 that yields a new 
sequence of optimal decision rules from time steps fc = 1 to 
N. The sequence of policy is time consistent if the strategies 
obtained when solving the original problem at time fc = 0 
remain optimal for all subsequent problems. 

Recently, the concept of time-consistency has also been 
extended to the context of risk measures [3], [4], [5], [6], 
[7], [8], [9]. In these papers, the authors formally defined 
the notion of time-consistency of risks, provided examples of 
time consistent risk measures and axiomatically justified that 
this property is necessary to develop rational risk assessments 
in stochastic processes. In [10], the author showed that 
expectation and worst case risk are the only time consistent 
coherent risks, and the authors in [11] developed a necessary 
and sufficient condition for time consistent risk measures. 
Furthermore in [12], the authors provided tight approxima- 
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tions of time inconsistent coherent risks by lower or upper- 
bounding them with time consistent metrics. 

Note that common examples of time consistent risk mea¬ 
sures include expectation and entropic risk measures. From 
[11], by posing an unconstrained stochastic optimal control 
problem with a time consistent risk measure, one obtains 
a time consistent solution policy by dynamic programming. 
However [13] shows that a risk constrained problem is not 
necessarily time consistent even if both objective function 
and constraints are time consistent risk measures. This results 
in undesirable outcomes for example when a decision maker 
seeks to minimize expected loss subjected to a risk con¬ 
straint, the optimal policy at present may become infeasible 
in future when the risk is re-evaluated. 

While time consistent policies are essential to ensure 
rational decisions, it has been pointed out in [2], [14], 
[15] that in a constrained SOC setup, time consistency is 
not necessarily satisfied by an optimal policy. There are 
several sufficient conditions to guarantee time consistency 
for specific constrained SOC problems. In [16], the author 
provided a sufficient condition for time consistency in de¬ 
terministic optimal control problems. Also, [13] showed that 
the risk constrained problem is time consistent if the risk 
measures are optimality consistency, i.e., any constraints that 
are feasible at present will also be also feasible in future. 
Furthermore [2], [14] argued that a SOC problem is time 
consistent if constraints are satisfied at every sample history 
path. However the above sufficient conditions are either 
restricted to a small SOC problem subclass or verifying this 
condition requires exponential computational complexity. In 
contrary the results in this paper shed light to a simple and 
analytic sufficient condition of time-consistency for a general 
class of risk constrained SOC problems. 

Our contributions of this paper are three-fold. 

• First, we formulate a risk constrained SOC problem 
with time consistent risk measures and show that it can 
be solved by dynamic programming techniques (in the 
augmented action space). 

• Second, by reformulating the above problem into an 
augmented Markov decision problem, we develop an 
analytical solution for the “risk-to-go” that results a time 
consistent optimal control policy. 

• Third, we illustrate the effectiveness of this analytical 
method by solving for a time consistent optimal policy 
to Haviv’s “squander or save” counter-example [2] on 
time-inconsistent planning. 


The rest of the paper is organized as follows. In Section 
[n| we provide a review of the theory of risk metrics and 
Markov decision processes. In Section we provide an 
analytical solution to the risk-to-go update that yields a time 
consistent optimal control policy. In Section IV we further 
justify our solution method by solving for a time consistent 



optimal policy to the “squander or save” problem. Finally, 
the conclusion and future work are discussed in Section IV] 

II. Preliminaries 

In this section we provide some background for dynamic, 
time-consistent risk metrics and risk constrained SOC prob¬ 
lems, on which we will rely extensively later in the paper. 

A. Markov Decision Processes 

A finite Markov Decision Process (MDP) is a four-tuple 
(S', [/, Q, [/(•)), where S, the state space, is a finite set; U, 
the control space, is a finite set; for every x G S, U {x) C U 
is a nonempty set which represents the set of admissible 
controls when the system state is a;; and, finally, Q{-\x,u) 
(the transition probability) is a conditional probability on S 
given the set of admissible state-control pairs, i.e., the sets 
of pairs {x,u) where x G S and u G U{x). 

Define the space Hk of admissible histories up to time 
k by iFf. = x S x U, for fc > 1, and Hq = S. 

A generic element hQj, G iFfe is of the form = 

{xo,uo,... ,Xk-i,Uk-i,Xk)- Let 11 be the set of all de¬ 

terministic policies with the property that at each time k 
the control is a function of /lo.fc- In other words, 11 := 

|{7ro : Hq -g U, TTi : Hi ^ U,.. .}\TTk{ho^k) G 

U{xk) for all ho^k G Hk, k > o|. 

B. Dynamic, Time-consistent, Risk Measures 

Consider a probability space {D,,!F,P), a filtration Fi C 
J- 2 ' ■ ■ G1 Fn G1 F, and an adapted sequence of random 
variables Zk, k G {0, - ■ ■ , N}. We assume that Fq = {O, 0}, 
i.e., Zq is deterministic. In this paper we interpret the 
variables Zk as stage-wise costs. For each fc G {1, • • • , N}, 
define the space of random variables with finite pth order 
moment as Zk := Lp{Q.,Fk,P), p G also, let 

Zk,N ■— Zk X ■ • • X Z]\[. 

A dynamic risk measure is a sequence of monotone 
mappings pk,N ■ Zk,N -G Zk, k G {0, ...,W}. Roughly 
speaking, a dynamic risk measure is time consistent if it is 
such that, when a Z cost sequence is deemed less risky than 
a W cost sequence from the perspective of a future time 
k, and both sequences yield identical costs from the current 
time I to the future time k, then the Z sequence is deemed 
less risky at the current time /, as well. We refer to [11] 
for a formal definition of time consistency. It turns out that 
dynamic, time-consistent risk metrics can be constructed by 
“compounding” coherent one-step conditional risk measures, 
which are defined as follows. 

Definition II.l (Coherent One-step Conditional Risk Mea¬ 
sures). A coherent one-step conditional risk measures is a 
mapping pk : Zk+\ -G Zk, k G {0,..., N}, with the 
following four properties: 

• Convexity: pk{^Z -f (1 — X)W) < \pk{Z) -f (1 — 
\)pk{W), VA G [0,1] and Z,W G Zk+i; 

• Monotonicity: if Z < W then Pk{Z) < pk{W), 
VZ, W G Zk+i; 

• Translation invariance: pk{Z-\-W) = Z-\-pkiW), VZ G 
Zk and W G Zk+i,' 

• Positive homogeneity: pk{XZ) = Xpk{Z), VZ C Zk+i 
and A > 0. 

The compositional structure of dynamic, time-consistent 
risk metrics is then characterized by the following theorem. 


Theorem II.2 (Dynamic, Time-consistent Risk Metrics 
[11]). Consider, for each k G {0, • • • ,N'\, the mappings 
Pk,N ■ Zk,N Zk defined as 

Pk,N = Zk -G Pk{Zk+i + Pk+l{Zk+2 + ■.. + 

PN-2{Zki-1 -G p7V-l(ZAr)) ...)), 

where the pk’s are coherent one-step conditional risk mea¬ 
sures. Then, the ensemble of such mappings is a dynamic, 
time-consistent risk measure. 


In this paper we consider a (slight) refinement of the con¬ 
cept of dynamic, time-consistent risk metric, which involves 
the addition of a Markovian structure [11] and enables the 
development of dynamic programming equations. 

Definition II.3 (Markov Dynamic Risk Measures [11]). Let 
V := Lp{S,B, P) be the space of random variables on S 
with finite pth moment. Given a controlled Markov process 
{xk}, a dynamic, time-consistent risk metric is a Markov 
dynamic risk metric if each coherent one-step conditional 
risk measure pk ■ Zk+i -G Zk in 0 can be written as: 

Pk{V{xk+i)) = crk{V{xk+i),Xk,Q{xk+i\xk,Uk)), (2) 

for all V{xk+i) G V and u G U{xk), where Uk is a coherent 
one-step conditional risk measure on V (with the additional 
technical property that for every V{xk+i) G V and u G 
U{xk) the function Xk (Jk{V{xk+i),Xk,Q{xk+i\xk,Uk)) 
is an element of V). 

In other words, in a Markov dynamic risk measures, the 
evaluation of risk is not allowed to depend on the whole past. 


C. Stochastic Optimal Control with Dynamic, Time- 
consistent Risk Constraints 

Consider an MDP and let c : S' x (7 —> M and d : S x 
?7 — M be functions which denote costs associated with 
state-action pairs. Given a policy tt S 11, an initial state 
xo G S, and an horizon > 1, the multi-stage cost function 
is defined as 


J^{xf) E 


^-^iV—1 
Z-^k—Q 


c{xk,Uk) , 


and the risk constraint is defined as 


R^ixo) := Po,N (d{xo, uq), ..., d{xN-i,UN-i), 0 ^, 


where pk ,n{-), k G {0,..., W — 1}, is a Markov dynamic 
risk metric (for simplicity, we do not consider terminal costs, 
even though their inclusion is straightforward). The problem 
is then as follows: 

Optimization problem OPT — Given an initial 
state xq G S, a time horizon > 1, and a risk 
threshold tq G K, solve 


min Jn{xo) 

ttGII 

subject to RJf{xo) < tq. 

If problem OPT is not feasible, we say that its value 
is oo. In [1] the authors developed a dynamic program¬ 
ing approach to solve this problem. To define the value 
functions, one needs to define the tail subproblems. For a 
given k G {0, ... ,N — 1} and a given state Xk G S, we 
define the sub-histories as hkp '.= {xk,Uk, ■ • ■ ,Xj) for j G 
{k,..., N}; also, we define the space of truncated policies 




> 


|{7rfc,7rfe+i,...}|7rj(/ifcj) S U{xj)forj 

For a given stage k and state Xk, the cost of the 
tail process associated with a policy tt G 11^ is simply 


as II/j := 
k 


JN^Xk) E cixj,Uj) 

the tail process is: 


. The risk associated with 


R 


'N 


(xfc) := pk,N {d{xk, Uk ),..., d{xN-i,UN-i), o) . 


The tail subproblems are then defined as 

min JJ^[xk) (3) 

Trellfc 

subject to Rjfixk) < rk{xk), (4) 

for a given (undetermined) threshold value rk{xk) G K (i-c., 
the tail subproblems are specified up to a threshold value). 

For each k G {0 ,... ,N — 1} and Xk G S, we define the 
set of feasible constraint thresholds as 


MDP problem associated with such augmented state space 
as augmented MDP (AMDP). The main result in [1] is the 
following theorem about the correctness of value iteration 
for AMDP 

Theorem II.4 (Bellman’s Equation with Risk Constraints 
[1]). For all k G {0,..., — 1} the value functions satisfy 

the Bellman’s equation: 

Vk{xk,rk) = Tk[Vk+i]{xk,rk). 

Next, we present a procedure to construct optimal policies. 
Under the assumptions of Theorem |II.4| for any given Xk G 
S and Vk G ^k{xk) (which implies that Fk{xk,rk) is non¬ 
empty), let u*{xk,rk) and r'{xk,rk){-) be the minimizers in 
equation (|^. Next theorem shows how to construct history 
dependent optimal policies. 

Theorem II.5 (Optimal Policies). Let tt G H be a policy 
recursively defined as: 


^kixk) ■= [RNixk),oo), ^n{xn) ■= [0, oo), 


T^kihk) = u*{xk,rk), with Vk = r'{xk-i,rk-i){xk)., 


where Ri^{xk) '■= min^gnj, RJ^i^k)- One then defines the 
value functions as follows: 

* If fc < A and Vk G ^k{xk)'- 

Vk{xk,rk) = vain JJjixk) 

irGlIfe 

subject to RJ{{xk) < Vk- 

* If fc < A and Vk ^ ^k{xk)'- 

Vk{xk,rk) = oo. 

* When k = N and tn G ^n{xki) = [0, oo]: 


when k G A — 1}, and 

7r(xo) = M*(a:oAo), 

for a given threshold rg G $o(a:o)- Then, tt is an optimal 
policy for problem OVT with initial condition Xg and 
constraint threshold rg. 

Interestingly, if one views the constraint thresholds as state 
variables (whose dynamics are given in the statement of 
Theorem |II.5| l, the optimal (history-dependant) policies of 
problem OVT have a Markovian structure with respect to 
the augmented control problem. 


VN{xN,rN) = 0. 

Let B{S) denote the space of real-valued bounded functions 
on S, and B{S x M) denote the space of real-valued bounded 
functions on S' x K. For k G {0,..., A — 1}, we define the 
dynamic programming operator Tk[Vk+i] : B(S x M) 
B(S xM.) according to the equation: 


Tk[Vk+i]{xk,rk) ■■= inf 

{u,r')GFk(xk,rk) 


c{xk,u) + 

(5(a:fc+i|a;fe,u)14+i(a:fc+i,r'(a;fc+i)) L 




(5) 


where Fk is the set of control/threshold/wncfioni: 
Fk{xk,rk) :=|(M,r') u G U{xk),r'{x') G $fc+i(x') for 

all x' G S, and d{xk,u) + pk{r'{xk+i)) < Vk >. 


If Fk{xk,rk) = 0, then Tfc[I4+i](xfc, r^) = oo. 

For a given state and threshold constraint, Fk characterizes 
the set of feasible pairs of actions and subsequent constraint 
thresholds. Feasible subsequent constraint thresholds are 
thresholds which if satisfied at the next stage ensure that 
the current state satisfies the given constraint threshold. Note 
that the value functions are defined on an augmented state 
space, which combines the original (discrete) states Xk with 
the real-valued risk-to-go states r^. We will refer to the 


III. Time consistency and conceptual risk-to-go 

We start this section by demonstrating some time- 
inconsistency behaviors in risk-sensitive optimal control 
problems using several counter-examples. 


A. Time Inconsistent Planning Leads to Irrational Behaviors 
The most common strategy to model risk awareness in 
MDPs is to consider a static risk metric (i.e., a metric 
assessing risk from the perspective of a single point in 
time) applied to the entire stream of future costs. Typical 
examples include variance-constrained MDPs [17], [18], 
[19], or problems with probability constraints [17], [20], 
[21], [22], which are popular in the robotics community (in 
these problems risk is assessed only from the perspective of 
the initial stage). However, since static risk metrics do not 
involve a reassessment of risk at subsequent decision stages 
they generally lead to irrational behaviors. For example, a 
UV can seek to incur losses (i.e., dangerous maneuvers) or 
can deem as dangerous states that are indeed favorable under 
any realization of the underlying uncertainty. 

In this subsection, we will illustrate some irregular behav¬ 
iors in risk sensitive multi-period planning by two examples. 
Example 1: Variance-constrained planning — 
Given an MDP with initial state xg G S and time 
horizon A > 1, solve 


min 

TT 


subject to 


® Ylk=0 <^(.Xk,Uk) + cn{xn) 

.N-1 , 

var E d{xk, Uk) + dN{xN) 1 < ro, 
^ k^O ' 







where tq € K is a user-provided risk threshold. 
Consider the example in Figure When the risk threshold rg 
is below 25, policy tti is infeasible and the optimal policy 
is 712- According to policy 712 , if the decision maker does 
not incur a cost in the first stage it seeks to incur losses in 
subsequent stages to keep the variance small. This can be 
seen as a consequence of the fact that Bellman’s principle 
of optimality does not hold for this class of problems. 



(a) Stage-wise constraint and objective 
function costs and transition probabili¬ 
ties for policy tti . 



(b) Stage-wise constraint and objective 
function costs and transition probabili¬ 
ties for policy n 2 . 

Fig. 1. Limitations of mean-vaiiance optimization. Underlined numbers 
along the edges represent transition probabilities; non-underlined numbers 
represent stage-wise constraint and objective function costs (that are equal 
for this example). Terminal constraint costs are zero. Under policy tti, the 
costs per stage are given by d{sQ, uq) = 0.5- 0 + 0.5-10 = 5, d{si , wi) = 
10, and d{s 2 iUi) = 10; under policy n 2 , the costs per stage are given by 
^( 50 ,^ 0 ) = 5, d{si,ui) = 20, and d{s 2 ,ui) = 10. One can verify that 
for policy tti one has var^y^^^^ d{xi^, + d(xjs[)^ = 25, while for 

policy 7r2 one has var^y^^^^ + d{xj^)^ = 0. Then, if the 

risk threshold is less than 25, the decision-maker would choose policy 7r2 
and would seek to incur losses in order to keep the variance small enough. 


As a second example, we consider MDPs with average 
value at risk (AVaR) constraints, which are closely related 
to chance (i.e., probability) constraints and are enjoying a 
growing popularity, especially in the finance industry [23], 
due to favorable computational aspects, such as convexity. 
The average value at risk for a random variable X at 
confidence level a is defined as [24]: 

1 

AVaRa(X) :=- / VaRr(2f)dT, 

1 - a 7a 

where VaRQ(A') is simply the a-quantile of random variable 
X, i.e., 


VaRQ(Ar) = min{a;| P [W < a;] > a}. 

Intuitively, the AVaRa is the expectation of X in the 
conditional distribution of its upper a-tail. For this reason. 


it can be interpreted as a metric of “how bad is bad.” The 
risk metric AVaRa is closely related to chance constraints, 
since the constraint VaRa(-A) < 0 corresponds to the chance 
constraint P [W < 0] > a [24]. 

Example 2: AVaR-constrained planning — 
Given an MDP with initial state xq G S and time 
horizon N > 1, solve 


min 

TT 


subject to 


E 


Ylk=0 c{xk,Uk) + Cn{xn) 


,N-1 , 

AVaRa ( y] d{xk,Uk) + dNixN) j <ro, 
^ k=0 7 


where rg G M is a user-provided risk threshold. 

Let us interpret the constraint costs d as acceptable if 
negative and unacceptable otherwise. Accordingly consider 
the example in Figure (based upon [25]), with threshold 
rg = 0 and confidence level 1/3. One can show that the 
problem (consisting of a single policy) is infeasible, since 
at the first stage AVaR is positive. On the other hand, the 
constraint costs are acceptable in every state of the world 
from the perspective of the subsequent stage. In other words, 
the decision-maker would deem infeasible a problem that, 
at the second stage, appears feasible under any possible 
realization of the uncertainties. 


AVaRi/3 = 1 



Fig. 2. Limitations of AVaR-constrained optimization. The numbers along 
the edges represent transition prohahilities, while the numbers below the 
terminal nodes represent the terminal constraint costs (the other constraint 
costs are zero). The problem involves a single control policy (hence there is 
a unique transition graph). The constraint cost appears acceptable in states 
Si and S 2 , but unacceptable from the perspective of the first stage in state 

so¬ 


il is important to note that there is nothing special about 
these examples, which indeed capture a range of widely 
accepted criteria to pose risk constraints. Similar paradoxical 
results could be obtained with other risk metrics. Flenceforth, 
we will collectively refer to the aforementioned irrational 
behaviors as “time-inconsistent” policies, since they reflect 
an inconsistent risk assessment over time. 

B. Time Consistency 

From Theorem III .4 1 and Theorem III. 51 we can find a 
sequence of history dependent optimal control policies by 
Bellman iteration. In this section, we want to show that the 
risk constrained SOC in Problem OVT is time consistent. 
Most analysis in literature (c.f. Chapter 1 of [26], [27] for 
more details), restricts the analysis of time consistency to 
problems with Markovian policies. Since Problem OVT 
solves for a sequence of history dependent policy, it is 
unclear how we can analyze time consistency directly by 









truncating the sub-histories. For this purpose, we define the Lemma III.2. For any value function i4+i : S x 
space of augmented state feedback policies and the space of $fc+i(S') —?► M, V/c S {0 ,... ,N — 1} for Problem OVT, 
risk-to-go updates: the following equality holds: 


n 


771 ._ 

k ’ — 


|{7rfe,..., 7rjv_i} : 


TTj-.Sx U{S), 


( (xj , rj ) (xj+i) = rj+i - 

^k'=\ {T^k+1, ■ ■ ■ e ^j+i{xj+i), 

[ j e{k,...,N-l} 


and the following optimization problem: 

Optimization problem OVT^ — Given an initial 
state xo G *5, a time horizon iV > 1, and a risk 
threshold tq G K, solve 


^ki^^kt^k) — T^k\y^k+l\^^ k 1k) — ? r/e), Vx/j;, 

(6) 


where 


Tk[V]{xk,rk) ■= inf lc{xk,u)F 

{u,r')GFk(xk,rk) L 

Q{xk+i\xk,u)V{xk+i,r'{xk+i))>, 

Xk+iGS ^ 

and F^ is the set of control/threshold functions.' 


min 

subject to 


Jn i^o) 

Pj{rj+i +dj{xj,nj{xj,rj))) = rj, 
G+i = T^3+iixFrj)(xj+i)+rj 

0<rN, j e {0,... ,iV - 1}. 


The fc—th subproblem of Problem OVT^ is simply de¬ 
fined by replacing xq £ S and rg G ‘ho(a:o) by xt G S and 
rk G ^k{xk), i-e., 

• If fc < TV and Vk G $fc(a;fc): 


Fk{xk,rk) :=|(ii,r') uG U{xk),r\x') G ‘^k+i{x') for 
all x' £ S,r'{x') = rk + L{x'), and pk{L) + d{xk,u) = 0^. 


Proof First it is obvious that Fk{xk,rk) C Fk{xk,rk) 
and Tk[Vk+i]ixk,rk) < Tk[Vk+i]{xk,rk). Now, suppose 
Fk{xk,rk) ^ 0 and there exists an risk-to-go r'’*{xk+i) 
such that 


d{xk,u*) + pkir'’*{xk+i)) < Tk 


dJSl ixk) 

P3{'>'i+^+dj{xj,'Kj{xj,rj))) = 0, 
G+i = '^i+i{xj,rj){xj+i) -f Tj 

0 < rjv, j G {fc,... ,TV - 1}. 

• If fc < iV and rk ^ ^k{xk)'- 

V,^{xk,rk) = oo. 

• When k = N and tat G ^n{xki) = [0, oo]: 

V^{xN,rN) = 0. 

We are now in the position of defining the notion of time 
consistency for Problem OVT^ ■ 

Definition III.l. (Time Consistency of risk constrained SOC) 
For any given initial state Xq £ S, and risk threshold rg £ 
$o(a^o)> define tt* = {tTq, ... ,7r]^_]^} G 11^^ as a sequence 
of optimal policy and r* = K,...,r^} as a sequence of 
risk-to-go for Problem OVF^^■ Problem OVF^ is a time 
consistent SOC problem, if at any k £ {0,..., TV — 1}, the 
k-subsequence of it*: tt^ := {tt^, ..., and r*: 

Xk+i N ■ • • ’ ^5/} sequences of optimal solution 

to the k-tail subproblem of Problem OVF^ at initial state 
Xk £ S, and risk threshold rk G ^k{xk)- In this case, tt* is 
a time consistent optimal policy, and TZ* is a time consistent 
risk-to-go. 

Before getting into the main result, we want to justify the 
equivalence between Problem OVT^^ and OVT. First, we 
have the following technical lemma, showing that without 
loss of optimality, the inequality constraint in F(xk,rk) can 
be replaced by an equality. 


where u* is the optimal control input solved from the 
Bellman’s equation in ( 0 . Recall the definition of the 
value function of Problem OVT, Vk+i{xk+i,rk+i) = 
mm{Jf^{xk+i) : tt £ Uk+i, RJj{xk+i) < rk+i} 
when rk+i £ It can be easily seen that 

14 +i(a:fc+i, rfe+i) is a non-increasing function with respect 
to rk+i. Furthermore, since pk is Lipschitz and r'’*(xk+i) 
is a discrete state, continuous magnitude random variable, 
we can always find a nonnegative discrete state, continuous 
magnitude, bounded random variable e{xk+i) such that 

d{xk,u*) + pk{r''*{xk+i) + e(xk+i)) = rk. 

Furthermore, we know that r'’*(a;fc+i) -I- e(a;fe+i) > 
r'’*{xk+i) surely, which implies r'’*{xk+i) + €{xk+i) > 
R]kf{xk+i). On the other hand, since rk, e{xk+i) 
and r'’*{xk+i) are all bounded quantities, this im¬ 
plies r'’*{xk+i) + e{xk+i) < oo surely. By writing 
f'’*(xk+i) = r'^*{xk+i) + e(xk+i) > r'’*{xk+i), one 
obtains Vk+i(^Xk+i,r'’*(xk+i)) > Vk+i(xk+i,f''*(xk+i)) 
surely. Thus, we can use {u*,f'’*) £ Fk{xk,rk) as a 
minimizer for the Bellman’s equation in 0 - To summarize, 
whenever Fk{xk,rk) 7 ^ 0 , there always exists an optimal 
control input u* and risk-to-go F’*(xk+i) for Bellman’s 
equation in 0 such that d{xk,u*) + Pk{x''*{xk+i)) = ’'fe- 
By letting FTx') = r'’*(x') — rk, \/x' £ S, we have just 
showed that there is no loss of optimality to consider the 
operator Tk[Vk+i](,Xk, rk) instead of the Bellman’s equation 
in □ 

Essentially, we have the following result. The proof of 
this theorem is analogous to the proof of Theorem |II.4| and 
is omitted here for the interest of brevity. Details of the proof 
can be found in the Appendix. 

Theorem III.3. The value function of Problem OVT^ is 
identical to the value function of Problem OVT. Further¬ 
more, Problem OVT^ is a time consistent SOC problem. 


yk^i.Xk,rk) = 
min 

7i-Gn™.7?,e3tfc 

subject to 



Assume the infimum in expression is attained. Let Tq = 
vq. Based on Definition III.1| any optimal control polices 
and optimal risk-to-go found from the sequence of Bellman’s 
equation, Vfc € {0,..., — 1}: 


(7r^(a;fc,r^),r^+i) S argmin(„ 
c{xk,u) + ^ Q{xk+i\xk,u)Vili{xk+i,r'{xk+i))> 

Xk+iGS ' 

are time consistent. 

Remark III.4. Notice that the risk-to-go satisfies the follow¬ 
ing equation: 

Pk(rl+f) + dk{xk,Ttl{xk,rl)) = r*k, Vk G {0,..., N - 2}. 
where Tq = Tq- Define 

k 

Mk+i = rl+j^+'^dj{xj,n*{xj,r*)) 
j=o 

We can show that Mpf_i satisfies the following risk sensitive 
Martingale property: Pk{Mk+i) = ^k- 

Next, we have the following corollary depicting the closed 
form solution policy of Problem OVT- The proof is identical 
to the proof of Theorem |II.5| and will be omitted for brevity. 

Corollary III.5. Let tt G H be a policy recursively defined 
by the solution of Bellman’s equation in (§■• 

Ttkihk) = u*{xk,rl), with rl = rl_^+L*{xk-i,r*k_f){xk), 

when /c G {1,..., — 1}, and 


7ro(a:o) = u*(a;o,ro), 

where = Tq G $o(a;o). Then, tt = {tto, ... ,7rjv-i} is an 
optimal policy for Problem OVT with initial state Xq and 
threshold Tq. Furthermore, the k—th subsequence of n, i.e., 
{iTk, ■ ■ ■, TTjv-i} is also an optimal history dependent policy 
to the tail subproblem of Problem OVT- 

Similar to problem OVT, this corollary concludes that the 
optimal policy of problem OVT^ can also be constructed 
using both state update and the risk-to-go. The resultant 
policy is thus history dependent. However it is still unclear 
how one can obtain the analytical formula of the risk-to-go 
from merely solving the Bellman iteration in (|^. 

C. Analytical Formula for the Risk-to-go Update 

Based on the time consistency analysis in previous sec¬ 
tions, we aim to derive the analytical update formula for the 
risk-to-go Tfc. Before getting to the main result, we need the 
following technical lemmas. 

Lemma III.6. Suppose rk = RJfixk) - Rlj{xk-i) -f Ck-i 
for k G {l,...,Af — 1} and tt G H A any admissible 
history dependent control policy. Then, if rp G <Po(xo) 
such that RJf{xo) < tq, it implies rk G ^k{xk) for all 
kG{Q,...,N -1). 

Proof First, we characterize the lower bound of r^. By 
definition, r^+i — Vj = — RJf{xj), for j G 


{0,..., A^ — 2}. By summing over j = 0 to /c — 1, for any 
fixed k G {0,..., A^ — 1}, we get 

rfc - ro = RJjixk) - Rn{xo) 

=^rk- RJfixk) =ro- RJf{xo) 

Thus, one obtains 


rk - RNi^k) >rk- RN{xk) = ro- Rn{xo) > 0. 

The last inequality is due to the fact that tt S H is a feasible 
control policy. Next, we characterize the upper bound for r^. 
For Tk = RJf^Xk) — RJj{xk-i) G-Tk-i, by monotonicity and 
translation invariance of multi period risk measures, one can 
easily show that for any k G {0,..., A^ — 1}, 

Rni^k) <{N - fc)pmax, RJfixk) > {N - k)p„,i„. 
Therefore, the above expressions imply 


Xk Xk — l {N k^^Puiax Pmin) Pmin- 

By a telescopic sum, and since tq G [Rpj{xo),oo), one 
obtains 


k 

Xk — ^ ^ J)(pmax Pmin) ^Pmin T ^0 
i=i 


= ro + A: 




< (X) 


Thus, combining the result of lower bound for r^, we get, 
Xk G which means rk G ^k(xk) for k G 

{0,...,iV-l}. □ 


We are now in the position of deriving the main result 
of this paper. The following theorem provides an analytical 
update formula for the risk-to-go r^. 

Theorem III.7. (Conceptual Risk-to-go) Let it* G H be an 
optimal policy for Problem OVT- The following risk-to-go 

Xk = xo, 

ffc+i =kk-\-RN {xk+i)-RN {xk), yk G {0,... ,N - 1}, 

(7) 


and the augmented state-feedback control policy tt = 
{tto, ..., TT^v-i}, where Tr(xk,fk) = TT*{xk) form time 
consistent solution to Problem OVT^■ 


Proof Since Rdfj (xk) and Rfj ( 0 :^+ 1 ) are formed by com¬ 
pounding Markov risk measures, we can easily see from 
Definition II.3 that RJf (xk) and Kfj (xk+i) are functions 
of Xk and 0 :^+ 1 . Also, define 

Tk{xk+l) — Rn (^/c+l) Rn 


for any k G {0,..., A^ — 1}. From Lemma III.6 one obtains 
ffc+i = fk+Lk(xk+i) G ^k+i{xk+i), Vfc e {0,..., A^-2}. 
Also, as there is no terminal cost, by a telescopic sum. 


N-l 

xn = fo + '^ Lk{xk+i) = -Rn (xo) + xq- 

k=0 


As TT* is an optimal history dependent control policy for 
Problem OVT, one obtains (ccq) < fo and xn > 0. Fur¬ 
thermore, by the time consistent property and translational 
invariance of risk measures, we have that 

Pk{RN (xk+i)) + c{xk,TTUxk)) = Rn (xk), Vfc, 

Pk{RN (xk+l)) + c(xkTkixk, Xk)) = Rdfi {xk), Vfc, 








Thus, the optimal policy tt and the risk-to-go sequence f = 
{fi,... jf^r} are feasible to Problem OVT^^■ Then, Vfc G 


Vo{xo,ro) = (xo) = Jn^xo) > Vo{xo,ro). 

At the same time, equation ^ implies that {xo,ro) > 
Vo(xo,ro). Thus both arguments imply tt and f = 
{fi,..., rAf} form solution to Problem OVT ^^■ T ime con¬ 
sistency then follows directly from Definition 111.1 □ 


From this theorem, we conclude that the analytical formula 
of the risk-to-go can be written as a Martingale difference 
of the constraint cost function. This property is crucial to 
understand how risk evaluation is updated at each step in 
order to make time consistent decisions and to derive large 
scale risk constrained decision making algorithms. 


IV. The Squander or Save Example (c.e. [2]) 

We consider the simple case of risk constrained SOC 
problem where risk-neutral costs and risk neutral constraints 
are considered. Given an MDP with initial state xq G Sq, 
risk threshold tq G $o(xo) and time horizon iV > 1, solve 


min 

TT 

subject to 


E 


'N-l 

E 





c{xk,Uk) 

d{xk,Uk) 


< Xq, 


Consider the example in Figure and Figure with N = 
2, and all terminal costs equal to zero (they are not drawn 
in the figures). Also, let Sq be the initial state space. 


Si = {“win a lottery”, “lose a lottery”}, 

(7(“win a lottery”) = {“squander”, “save”}, 
(/(“lose a lottery”) = {“squander”, “save”} 
be the state and closed loop action space at stage 1, and 


choose to squander (action 1) or to save (action 2). If one 
loses the lottery, at stage 1 one can choose to squander 
(action 1) or to save (action 2). In this example, the stage- 
wise cost represents a level of satisfaction in spending and 
it is inversely proportional to the money spent. On the 
other hand, the stage-wise constraint cost is the probability 
of going bankruptcy, which is directly proportional to the 
money spent. 

Consider e = 0.1 and the risk threshold tq = 0.3. That 
is, the probability of winning lottery is 0.1 and one wants to 
limit the probability of bankruptcy to be under 0.3. Similar 
to Haviv’s argument in [2], if we keep the risk threshold 
constant at 0.3 for all subsequent stages, the optimal policy 
decided at stage 0 is not to squander if one loses the lottery, 
and to squander if one wins the lottery. However, the optimal 
policy decided in stage 1 is to save if one loses or wins the 
lottery. This implies that the optimal control policies decided 
at stage 0 is time-inconsistent. 

On the other hand, suppose one finds the optimal control 
policies by solving the Bellman’s equation in Q. Then, from 
the value function for Problem OVT, one obtains r 2 = 0 
and V{x 2 ,r 2 ) = 0. Now let sn = ‘win a lottery” and Si 2 = 
‘lose a lottery”. At stage 1, the value function is as follows: 


Vi(sii,ri(sii)) 


Vi(si2,'ri(si2)) 


—50 if ri(sii) > 1 
-30 if 0.05 < ri(sii) < 1 
oo otherwise 

—20 if ri(si2) > 0.4 
-10 if 0.2 < ri(si2) < 0.4 
oo otherwise 


and the optimal policy is as follows: 


7r{(sii,ri(sii)) 


7r{(si2,ri(si2)) 


“squander” if ri(sii) > 1 

“save” if 0.05 < ri(sii) < 1 

0 otherwise 


“squander” if ri(si 2 ) > 0.4 

“save” if 0.2 < ri(si 2 ) < 0.4 

0 otherwise 


{ “win a lottery and squander”, 
“win a lottery and save”, 
“lose a lottery and squander”, 
“lose a lottery and save” 


be the state space at stage 2. Suppose there are no actions 
(U{So) = {0},), no cost (c{xo,uo) = 0, V(xo,wo) G Sq x 
U{So)) and no constraint cost {d{xo,uo) = 0, V(a:o,uo) G 
S'o X U{So)) for winning/losing a lottery at stage 0. On the 
other hand, the stage-wise cost in stage 1 is as follows: 


c(“win a lottery”, ui) = 


c(“lose a lottery”,Ml) = 


—50 if ui=“squander” 
—30 if Mi=“save” 

—20 if Mi=“squander” 
— 10 if Mi=“save” 


and the constraint stage-wise cost is as follows: 


/(“win a lottery”,Ml) = 


/(“lose a lottery”. Ml) = 


1 if Mi=“squander” 
0.05 if Mi=“save” 

0.4 if Mi=“squander” 
0.2 if Mi=“save” 


Suppose at time 0, one has probability e of winning 
a lottery and probability 1 — e of losing a lottery where 
0 < e << 1. If one wins the lottery, at stage 1, one can 


Vo{so,ro) = min 

ri(')GK2 


At stage 0, with rg = 0.3, the value function is: 

r -50 if ri(sii) > 1 
0.1 X -30 if 0.05 < ri(sii) < 1 
[ oo if ri(sii) < 0.05 
( —20 if ri(si2) > 0.4 
-10 if 0.2 < ri(si2) < 0.4 
{ oo if ri(si 2 ) < 0.2 
subject to ri(sii) G [0.05, oo], ri(si 2 ) G [0.2, oo 
0.1 X ri(sii) -I- 0.9 X ri(si2) < 0.3. 


-fO.9 X 


This implies that the value function is 


^o(so,ro) = -10(0.9) - 30(0.1) = -12. 

Based on the risk-to-go update in equation (|^, we have that 

r*(so,ro)(sii) = 0.3 + 0.05 - (0.1(0.05) -f 0.9(0.2)) = 0.165, 
r*(so,ro)(si2) = 0.3-f 0.2 - (0.1(0.05) + 0.9(0.2)) = 0.315. 

Thus, the optimal history dependent policy at stage 0 is 

TToiho) = 0, 

*fh 'i — / if — win a lottery” 

~ { “save” if xi = ‘lose a lottery” ’ 

i.e., this optimal policy is time consistent. 










Fig. 3. Transition probability, cost function and constraint cost function 
under policy 1 (squander). The red numbers indicates the stage-wise cost. 
The underlined black numbers indicates the stage-wise constraint cost 



Fig. 4. Transition probability, cost function and constraint cost function 
under policy 2 (save). The red numbers indicates the stage-wise cost. The 
underlined black numbers indicates the stage-wise constraint cost. 


V. Conclusion 

In this paper we study time-consistency of risk con¬ 
strained problem where the risk metric is time consistent. 
From the Bellman optimality condition in [1], we establish 
an analytical “risk-to-go” that results in a time consistent 
optimal policy. The effectiveness of the analytical solution 
is also justified by solving Haviv’s counter-example [2] in 
time inconsistent planning. Future work includes extending 
the above analysis to large scale risk constrained decision 
making (via the use of approximate dynamic programming) 
that provides time consistent policies. 
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Appendix 

A. Proof of Theorem \III.3\ 

The proof follows from the Bellman’s equation with risk 
constraints. First, we want to prove that 

V^^{xk,rk) = Vk{xk,rk) (8) 


for any k G {0,..., — 1}, a;*: S S and 

Pfc e ^kixk) and k G {0, - 1}. At 

k = N — I, hy definition, V^^_i{xjs[-i,rN-i) = 
^rAr_i[V^](xAr_i,r7V-l) = Tat-i [V/v] (a^iV-l, PN-l)- 
Since Fat( a^y, tat ) = V^jxN.rN) = 0 for any tn G 


[0,oo], by Lemma III.2 one obtains V^j^(a;jv-i, pat-i) = 

TN-l[VN]iXN-l,rN-l) = Ti\[-i^N](xN-l,rN-l) = 
Vn- i{xN-i,rN-i)- Thus, equation ® holds for k = N—1. 
By inductive hypothesis, suppose for fc = j + 1, 


^j+iixj+itXj^i) — Vj+i{xj+i,rj^i) 


Then for k — j. Let tt* G be the optimal policy that 
yields the optimal cost {xj^rf) and r* be the sequence 
of optimal risk threshold update functions. By applying the 
law of total expectation, we can write: 


= E Ci{xi,TrUxi,ri)) 

= E [cj{xj,7r*{xj,rj)) + Jf;' (xj+i)]. 


Clearly, tt* is a feasible policy for the j +1* tail subproblem 
of Problem OVT^ with Xjj^i G S and r*j^i G $j+i(a:j+i). 
Collecting the above results, we can write 


> E [cj{xj,Tr*{xj,rj)) + V^^^{xj+i,r*^^)] 

~ 'yj[yj+i]i^jiXj) = Vj{xj,rj). 


The second inequality follows from the fact that 


{r*,..., r’^} found by solving the sequence of Bellman’s re¬ 
cursions in equation ([T0|. Now, consider the k subsequence: 


= 


fe’' 


and the k subsequence: 


iv-iK nf of 

= {ffc+ii ■ ■ • By definition, one obtains, 

d{xj,TT*{xj,rj)) + Pjir*+i) < r^, WN - 1> j > k. 
and the Bellman’s equation implies 
4* (Xk) =Tk[... [Tn- i[V^]]...]{xk,rk) = {xu , r^). 


This implies that if* is a sequence of optimal control policy 
of the k-tail subproblem of Problem OVT^, for any state 
Xk G S, and any risk threshold G (l>k(xk)- 


Pi(r*+i) + dj(xj,7r*(xj,rj))) < r^. 

and the first equality follows from induction’s assumption. 

On the other hand, for given pair {xj,rj), where Vj G 
<l>j(xj), let u*(xj,rj) and r'’*{xj,rj){xj+i) be the mini- 
mizers in Tj[Vj^i]{xj,rj). Construct a policy if G as 
follows: 7rj(xj,rj) = u*(xj,rj) and Tri(xi,ri) = TT*{xi,ri) 
for i > j + 1. Therefore, the policy tt G is a feasible 
policy for the j + 1* tail subproblem of Problem OVT^^ 
with Tj G ^j(xj). Hence, one easily obtains: 

< JNixj) = E [cj{xj,Trj{xj,rj))) + Jfj’(x^+i)] 
=Tj[Vj^i]ixj,rj) = Tj[Vj+i]{xj,rj) = Vj{xj,rj). 

By combining both steps, the claim in equation is 
proved by induction. Furthermore, by the equation ([8 1 and 
theorem [IL4| the following expressions hold: 

Vi^{xk,rk) = Vk{xk,rk) = Tk[Vk+i\{xk,rk) 

= Tk[Vk^+,]{xk,rk). 

By repeatedly applying this Bellman’s equation, one obtains 


4^(a:o,ro) = To[Ti[... [Tn-i[V^\\ - • .]](a:o,ro), (10) 

Also, define the sequence of optimal policy: tt* = 
{ttq, ..., and sequence of risk-to-go: r* = 








